D142-D147 Nucleic Acids Research, 2014, Vol. 42, Database issue 
doi: 10.1093 jnar I gkt997 



Published online 4 November 2013 



JASPAR 2014: an extensively expanded and updated 
open-access database of transcription factor 
binding profiles 

Anthony Mathelier 1 , Xiaobei Zhao 2 ' 3 , Allen W. Zhang 1 , Frangois Parcy 4 , 
Rebecca Worsley-Hunt 1 , David J. Arenillas 1 , Sorana Buchman 2 , Chih-yu Chen 1 , 
Alice Chou 1 , Hans lenasescu 2 , Jonathan Lim 1 , Casper Shyr 1 , Ge Tan 4 , Michelle Zhou 1 , 
Boris Lenhard 5 ' 6 '*, Albin Sandelin 2 '* and Wyeth W. Wasserman 1 '* 

department of Medical Genetics, Centre for Molecular Medicine and Therapeutics at the Child and Family 
Research Institute, University of British Columbia, Vancouver, BC, Canada, 2 Department of Biology and Biotech 
Research and Innovation Centre, The Bioinformatics Centre, Copenhagen University, Ole Maaloes Vej 5, 
DK-2200, Denmark, 3 Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 
27599, USA, 4 Laboratoire Physiologie Cellulaire & Vegetale, Universite Grenoble Alpes, CNRS, CEA, iRTSV, 
INRA, 38054 Grenoble, France, Computational Regulatory Genomics, MRC Clinical Sciences Centre, Imperial 
College London, Du Cane Road, London W12 ONN, UK, and 6 Department of Informatics, University of Bergen, 
Thormohlensgate 55, N-5008 Bergen, Norway 

Received September 15, 2013; Accepted October 3, 2013 



ABSTRACT 

JASPAR (http://jaspar.genereg.net) is the largest 
open-access database of matrix-based nucleotide 
profiles describing the binding preference of tran- 
scription factors from multiple species. The fifth 
major release greatly expands the heart of 
JASPAR— the JASPAR CORE subcollection, which 
contains curated, non-redundant profiles— with 135 
new curated profiles (74 in vertebrates, 8 in 
Drosophila melanogaster, 10 in Caenorhabditis 
elegans and 43 in Arabidopsis thaliana; a 30% 
increase in total) and 43 older updated profiles 
(36 in vertebrates, 3 in D. melanogaster and 4 in 
A thaliana; a 9% update in total). The new and 
updated profiles are mainly derived from published 
chromatin immunoprecipitation-seq experimental 
datasets. In addition, the web interface has 
been enhanced with advanced capabilities in 
browsing, searching and subsetting. Finally, the 
new JASPAR release is accompanied by a 
new BioPython package, a new R tool package 
and a new /7/Bioconductor data package to 



facilitate access for both manual and automated 
methods. 

INTRODUCTION 

Transcription factors (TFs) influence gene expression by 
binding to specific c/s-acting elements in a genomic 
sequence. Thus, accurate models for describing the 
binding properties of TFs are essential in modeling tran- 
scription. From a set of known transcription factor 
binding sites (TFBSs) for a given TF, the binding prefer- 
ence is generally represented in the form of a position 
weight matrix (PWM) (also called position-specific 
scoring matrix) derived from a position frequency matrix 
(PFM). A PFM is essentially an occurrence table, 
summarizing the number of each nucleotide observed at 
each position of a set of aligned TFBSs (1,2). Compared 
with simpler models like consensus sequences, PWMs 
allow for an additive probabilistic description of binding 
preferences (3). 

The JASPAR database holds collections of PFM nu- 
cleotide profiles based on published experiments from 
diverse sources, and has grown gradually from its incep- 
tion (4-7). The most widely used JASPAR collection is 
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JASPAR CORE, which is a curated non-redundant set of 
TFBS profiles for multicellular eukaryotes, based on ex- 
perimental evidence. The JASPAR database aims to 
provide the best canonical DNA binding profile per TF, 
as assessed by expert curators. Non-redundancy of TFBS 
profiles (i.e. one profile per TF) is intended with the 
exception of cases in which curators observe a clear 
difference in the sequence (e.g Nkx2-5) or length (e.g. 
JUND) at the core of a profile. Other JASPAR motif 
collections, with different characteristics than the CORE 
database, are available (7). 

Over the years, JASPAR has been equipped with func- 
tions aimed at casual and power users. The web-based 
graphical user interface functionality includes browsing, 
searching, subsetting and downloading, as well as basic 
sequence searching tools, dynamic clustering of matrices 
and generation of random PFMs by sampling selected 
profiles (4-7). 

Historically, JASPAR was populated by PFMs 
generated by in vitro site selection assays or collections 
of in-depth characterized sites, limiting both the number 
of TFs with binding profiles and the number of sites 
contributing to the profiles. With the development of 
high-throughput techniques that can assess in vitro or 
in vivo binding (8-10), it is now possible to generate 
binding models for most regulators, in multiple species. 
To this end, we have, in this fifth release, expanded the 
JASPAR CORE collection substantially, as well as 
updated the profiles of several existing ones with new 
data from high- throughput experiments. 

EXTENSIVE EXPANSION AND IMPROVEMENT OF 
JASPAR CORE 

The JASPAR CORE database has been substantially 
expanded. In total, 135 new PFMs have been added 
(a 30% increase), and 43 older PFMs (9% of last 
release) have been updated with new data, from verte- 
brate, insect, nematode and plant species (Table 1). 
These additions are described in more details later. 

We compiled published sequence-specific DNA binding 
TF chromatin immunoprecipitation (ChlP)-seq data col- 
lections into the PAZAR database (11,12) along with TF 
ChlP-seq datasets from the ENCODE (13-15) and 
modENCODE (16,17) consortia for Homo sapiens, Mus 
musculus, Drosophila melanogaster and Caenorhabditis 



elegans. From these studies, we extracted the bound 
regions, identified over-represented motifs close to the 
ChlP-seq peak max position (corresponding to the 
region where the maximum number of ChlP-seq reads 
are mapped) using the MEME suite (18) and constructed 
PFMs describing the binding preferences of the TFs (see 
Supplementary Text for details). 

As in previous JASPAR CORE additions, we manually 
curated the profiles. To confirm the putative binding 
patterns, we identified independent publications with 
TFBSs or profiles consistent with the candidates, as 
described in (7). To gain additional profiles, we considered 
bound regions derived from ChlP-chip experiments from 
modENCODE and (19) for D. melanogaster. A similar 
strategy as for ChlP-seq datasets was used to derive 
PFMs from ChlP-chip data (see Supplementary Text for 
details). In total, we obtained 45, 28, 8 and 10 high-quality 
PFMs in H. sapiens, M. musculus, D. melanogaster and 
C. elegans, respectively, for TFs that have never been 
described previously in JASPAR (see Supplementary 
Table SI). It represents a 57, 6 and 200% increase when 
compared with the previous release for vertebrates, insects 
and nematodes, respectively. The newly introduced verte- 
brate profiles are derived from 34 and 40 ChlP-seq experi- 
ments collected from PAZAR and ENCODE, 
respectively. The fact that almost 50% of the new PFMs 
are from individual studies collected in PAZAR highlights 
the importance of our manual retrieval of published ChlP- 
seq data. From ChlP-seq data sets of the vertebrate 
sequence- specific TFs not previously described in 
JASPAR, we obtained 71 (~60%) canonical motifs 
satisfying our literature-based manual curation (see 
Supplementary Table S2). The rich data from ChIP ex- 
periments allowed replacement of 39 existing profiles for 
TFs in mammals (36 PFMs updated) and in D. 
melanogaster (3 PFMs updated). 

As part of the curation of ChlP-seq data, and as 
introduced earlier, we computed a centrality score as 
described in (20), based on our expectation that the pos- 
itions where the maximum number of ChlP-seq reads map 
on the genome of reference will be strongly enriched for 
binding sites corresponding to the ChlPed TF (21). We 
provide the centrality plot and /6>g(P-value) for each 
newly introduced PFMs in vertebrates (see Figure 1), 
showing the propensity of the motif to be found close to 
the peak-max position in the corresponding peaks of the 



Table 1. Summary of content and growth of the JASPAR CORE database 



Subset 



Number of 
non-redundant 
profiles in 
JASPAR 4.0 



New non-redundant Updated Removed 

profiles in JASPAR 5.0 profiles profiles 



Total profiles 
(including older 
versions of profiles) 



Total profiles 
(non-redundant) 



Vertebrates 130 74 36 1 260 202 

Plants 21 43 3 67 64 

Insects 123 8 4 1 136 131 

Nematodes 5 10 15 15 

Fungi 177 177 177 

Urochordata 1 1 1 

Total 457 135 43 2 656 590 
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Figure 1. Screenshot of an example TFBS profile in new layout. 



ChlP-seq dataset used to generate the profile (see 
Supplementary Figure SI). The high quality of the verte- 
brates PFMs and the ChlP-seq datasets used to construct 
them is reflected by the low centrality /^(P-values), 
which are all below —200, with the exception of the 
Bachl::Mafk, ESRRA, FOXP1, FOXP2, Hoxa9, Sox6, 
SP2, SREBF1, SREBF2, and THAP1 binding profiles 
(see Supplementary Table SI). 

Moreover, we expanded the collection of PFMs for 
Arabidopsis thaliana TFs in JASPAR, with the first 
targeted JASPAR curation effort for plant TFs. We 
have included 43 new DNA-binding profiles for A. 
thaliana TFs, more than tripling the plant content in 
JASPAR CORE, and we updated three previous PFMs. 
The profiles are derived from in vitro and in vivo 



experiments (8 new profiles are constructed from ChlP- 
seq experiments, 8 from ChlP-chip experiments, 6 from 
protein binding microarray experiments and 24 from 
SELEX experiments). 

MODELS FOR DUAL BINDING BY THE SAME TF 

In this release, in extremely select cases, we introduce 
multiple binding profiles for a same TF, as motivated by 
the fact that some TFs display diverse target specificity 
that cannot be represented using a single PFM model. 
For instance, JUND has been previously shown to bind 
the DNA with motifs of flexible lengths (22) with a core 
composed of either TGACGTCA or TGAC/GTCA, 
where C/G stands for C or G. The two new profiles 
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Figure 2. TFBSs with two different profiles. (A) JUN, (B) JUND, (C) Nkx2-5 and (D) RAVI. 



introduced for JUND (see Figure 2A) are derived from the 
same ChlP-seq dataset, confirming the binding to the two 
subclasses. Similarly, we introduce two profiles for JUN 
(see Figure 2B), displaying equivalent characteristics to 
the JUND profiles. A new profile for Nkx2-5 (see Figure 
2C) derived from ChlP-seq data has been introduced. It 
differs substantially from an in vitro SELEX experiment- 
based profile but has been confirmed to reflect binding 
properties of Nkx2-5 (23). Finally, we introduce two 
binding profiles associated to the plant TF RAVI, as it 
can bind to two unrelated motifs by using two distinct 
DNA-binding domains (24) (see Figure 2D). The philoso- 
phy of maintaining JASPAR as a non-redundant collec- 
tion remains a driving approach to curation. In these 
special cases in which we allow unique pairs of profiles 
for the same TF, the TF presents distinct binding 
capacities that cannot be captured within a single PFM. 



ENHANCED WEB INTERFACE AND NEW 
RESOURCES FOR POWER USERS 

For casual users, we have enhanced the web search inter- 
face to the JASPAR database. Fuzzy searching is now 
enabled to search one or multiple profiles by gene name, 
species official or common name, protein accession ID, 
DNA-binding domain family or class, experiment type 
(e.g. ChlP-seq) and any other keyword associated to the 
profile(s) in the underlying database. This fuzzy searching 
performs approximate string matching in case-insensitive 
mode and offers suggestions below the search box while 
typing. It also includes the gene name aliases from HGNC 
(PMID: 23 161694) for searching gene synonyms. 
Furthermore, for each TF profile, we have now included 
links to the Transcription Factor Encyclopedia (25) and to 
the protein structures from the Protein Data Bank when 
available (26). Each binding profile links to the corres- 
ponding TFBSshape profile of DNA structural analysis 
(27). 

For power users, we have developed an open source 
Python package (freely available at https://github.com/ 
biopython/) within the extensively used tools of the 
BioPython Project (28). We implemented the jaspar 
package as part of the 'motifs' BioPython package, 
which provides functions such as reading profiles, 
writing profiles, scanning sequences for motif instances 



and more. The specific jaspar 'motif class allows to 
store all the metadata information related to the profiles 
in JASPAR, and specific functions allow the user to 
retrieve profiles from the database. We also developed 
an i?/Bioconductor (PMID: 15461798) software package 
TFBSTools, available at http://www.bioconductor.org/ 
packages/2. 13/bioc/html/TFBSTools. html under the 
General Public License-2 (GPL-2), to provide developers 
handy tools to generate, read and convert the JASPAR 
template, an internal data format to describe each motif 
instance and its meta information. An 7?/Bioconductor 
(29) data package JASPAR2014Data is freely available 
at http://www.bioconductor.org/packages/devel/data/ex- 
periment/html/JASPAR20 14.html to provide the users 
with tools for data analysis using the JASPAR profiles. 

In addition, a web-based curator interface was de- 
veloped for JASPAR, focusing on giving the super-users 
the ability to edit and update the database: this capacity is 
released for users wishing to produce custom PFM data- 
bases using the JASPAR framework. 

CONCLUSIONS AND FUTURE DEVELOPMENTS 

In this release of JASPAR, we have focused on the CORE 
database and expanded it primarily with new ChlP-based 
data. Although these types of expansions are important 
and will continue, the increasing availability of rich data 
sources highlights important questions for the future de- 
velopment of JASPAR, which need to be discussed with 
its user base. Two such larger questions are as follows. 

Non-redundancy versus species-specific matrix models? 

JASPAR CORE was originally designed with the clear 
goal of finding the 'best' PFM for a certain TF, unlike 
other databases that can hold several models for the 
same factor. Although many users have appreciated the 
clarity, it is not established how to resolve cases where the 
same factor has been characterized in-depth in two or 
more species. While this situation was rare in the early 
JASPAR versions, new experimental methods allows for 
probing binding specificity in several species with com- 
parative ease (30). In general, the binding specificity for 
orthologous TFs rarely changes to a substantial degree, 
but exceptions exist (31). Thus, future curation of 
JASPAR will have to resolve whether the non-redundancy 
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approach should be within each species or within larger 
clades. 



New types of models? 

Likewise, the sheer amount of sites that the new labora- 
tory methods generate provides sufficient information to 
produce predictive models that address more aspects than 
can be readily handled within the classic PWM frame- 
work — in particular, dependencies between positions and 
variable length motifs, which basic PWM models ignore. 
Here, one will have to consider the trade-off between 
possible higher specificity in binding predictions [see (32) 
for a detailed discussion] and the comfort of the commu- 
nity with the simpler PWM models. It is our plan to intro- 
duce newly designed Transcription Factor Flexible 
Models (33) derived from ChlP-seq data within 
JASPAR in the near future. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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APPENDIX 

During the production process, we analyzed the recently 
published ChlP-seq data sets from (PMID: 23953112). 
Three new profiles resulted and have been added to the 
new release of JASPAR. This late addition is not covered 
in the manuscript. 



